Colloquium | 13/01/2021

Collaborating on Reproducible Code… !?


Collaborating:

You and your collaborators (including your
future self) can access the code and its history

Reproducible:

Your code runs and produces identical results
at different time points and on different systems

Schedule

  1. Working in different contexts: RStudio Projects
  2. Dynamic document generation: RMarkdown
  3. Version control: Git + GitHub
  4. Package management: renv
  5. Containerization: Docker
  6. Where to start?

0. Kudos

0. Kudos


1. Working in different contexts: RStudio Projects

Needs a cool image :)

1. Working in different contexts: RStudio Projects - What & Why?

  • What it does:
    • Allows to work in multiple different contexts (projects), e.g. one for each experiment
    • Each project is own working directory, workspace, history, and source documents
    • Each project is associated with a folder on your computer (= working directory)
  • Why it helps:
    • Have a separate, shareable working environment for each experiment
    • Keep all the files associated with a project together — data, scripts, results, figures
    • Work on multiple projects at once, each associated with its packages (and package versions), loaded data, etc.
    • Use only relative paths
    • Necessary basis for version control

1. Working in different contexts: RStudio Projects – How?

  • In RStudio: File > New Project > …

1. Working in different contexts: RStudio Projects – Version 1: Create new project

1. Working in different contexts: RStudio Projects – Version 1: Create new project

1. Working in different contexts: RStudio Projects – Version 1: Create new project

1. Working in different contexts: RStudio Projects – Version 2: Create from existing directory

1. Working in different contexts: RStudio Projects – Version 3: Create from version control (Git)

1. Working in different contexts: RStudio Projects – Version 3: Create from version control (Git)

1. Working in different contexts: RStudio Projects – Open and manage projects

1. Working in different contexts: RStudio Projects – Open and manage projects

1. Working in different contexts: RStudio Projects – Tricks and troubleshooting

  • Relative paths: path separator characters vary across systems & anchor points differ depending on contexts
    • Use the here-package (Müller, 2020) to define relative paths within the project: read.csv(here::here("data", "file_I_want.csv"))

2. Dynamic document generation: RMarkdown

Also needs a cool image :)

2. Dynamic document generation: RMarkdown - What & Why?

  • What it does:
    • Creates dynamic documents with embedded chunks of code (R, Python, Julia, Stan, …), computed results , written text etc. (= LaTeX)
    • Markdown-files can be exported to documents (docx, rtf), presentations, pdfs, websites (html), … e.g using the knitr (Xie, 2015, 2020) and tinytex (Xie, 2015, 2020; for pdfs)
    • R code is dynamically rendered, and can be given in separate chunks (’’‘{r}’’‘) or inline (’ r … ’)
  • Why it helps:
    • Simple language (\(\neq\) LaTeX)
    • Integrates directly with statistical software (RStudio)
    • Saves code AND output in one file
    • Reduces copy & paste errors: reported results consistent with actual results

2. Dynamic document generation: RMarkdown - How?

  • Installation: install.packages("rmarkdown") (Allaire et al., 2017)
  • Install ‘knitr’ package for easy access: install.packages("knitr") (Xie 2015, 2020)

2. Dynamic document generation: RMarkdown - How?

  • Installation: install.packages("rmarkdown") (Allaire et al., 2017)
  • Install ‘knitr’ package for easy access: install.packages("knitr") (Xie 2015, 2020)
  • Open a markdown file (.Rmd): File > New File > R Markdown

2. Dynamic document generation: RMarkdown - How?

  • Installation: install.packages("rmarkdown")
  • Open a markdown file: File > New File > R Markdown

2. Dynamic document generation: RMarkdown - Tricks & troubleshooting

  • You don’t have RStudio installed: install Pandoc (http://pandoc.org) before installing markdown ()
  • Lengthy R code chunks: Install knitr-package (Xie, 2014, 2015, 2020) to customize chunks and knitting process
    • {r cache=TRUE,message=FALSE,warning=FALSE,results="hide", error = TRUE}
    • or use opts_chunk$set()-function
  • Knit to pdf: You need a LaTeX-installation
    • TinyTeX (Xie, 2010) is a light-weight, cross-platform distribution (install.packages("tinytex"); tinytex::install_tinytex()))
    • Separate code chunks by a blank line
  • Knit older .R code files: Put #’ in front of any top-level prose, including the header, or use:
#/*
rmarkdown::render(input = rstudioapi::getSourceEditorContext()$path,
                  output_format = rmarkdown::github_document()),
                  knit_root_dir = getwd()) #*/

3. Version control: Git + GitHub

3. Version control: Git + GitHub - What & Why?

  • What it does:
    • Tracks changes to files (data and code) over time: Sequence of “snapshots” (commits)
    • Allows to “go back in time”: Recall older versions or revert the entire project
    • Changes between commits can be compared
    • Organized in repositories: Collection of all snapshots
    • GitHub: Popular server for sharing materials (privately or publicly) and collaborating via git (also: GitLab and others)

3. Version control: Git + GitHub - What & Why?

  • Why it helps:
    • Keep things organized and track changes
    • Clean up code
    • Language agnostic
    • (Remote) backup
    • Work together, with collaborators (even simultaneously and parallel: branches, merges, pull requests) - and your future self
    • Web interface for your project and to track issues
    • Easily connected e.g. to osf.io

3. Version control: Git + GitHub – Installation

  • Register an account with GitHub: https://github.com/
  • (Update R, RStudio, and your packages: update.packages(ask = FALSE, checkBuilt = TRUE))
  • Is Git installed? Open your shell (“Terminal” in RStudio or on Mac, “Eingabeaufforderung” on Windows), and type: git --version. If “git: command not found”:
  • Install Git - Mac: Mac offers to install developer command line developer tools automatically. Click “Install”. If you don’t get the offer, type: xcode-select --install. Restart R.
  • Install Git - Windows: Install “Git Bash” (https://gitforwindows.org). Accept default settings. When asked about “Adjusting your PATH environment”, select “Git from the command line and also from 3-rd party software”. Restart R.
  • Configure Git: In the (Git Bash) shell, type
    • git config --global user.name 'your name'
    • git config --global user.email 'email associated with your GitHub account'
    • git config --global --list (Check whether everything worked)
  • Optional: Install a Git client. Find more info e.g. here: https://happygitwithr.com/git-client.html

3. Version control: Git + GitHub – Vocabulary

  • Vocabulary - Git:
    • Repo(sitory): Directory of files that Git manages holistically
    • Commit: Snapshot of all files in the repository, at a specific moment, each with a unique identifier (hash code or SHA) and description (commit message)
    • Diff: Set of differences between (any) two commits
    • Tag: Specific name for a certain snapshot (optional), e.g. “v1.0.3”, “preprint”, “submitted”

3. Version control: Git + GitHub – Vocabulary

  • Vocabulary - Git:
    • Repo(sitory): Directory of files that Git manages holistically
    • Commit: Snapshot of all files in the repository, at a specific moment, each with a unique identifier (hash code or SHA) and description (commit message)
    • Diff: Set of differences between (any) two commits
    • Tag: Specific name for a certain snapshot (optional), e.g. “v1.0.3”, “preprint”, “submitted”
  • Vocabulary - GitHub
    • Push: Send your local Git commits to GitHub
    • Pull: Compare and update your local Git with GitHub
    • Merge conflict: Git can’t be certain how to jointly apply diffs from two commits to their common parent. Resolve by picking manually, avoid by pushing often.

3. Version control: Git + GitHub – Code along

  • Go to https://github.com/ and log in
  • Click “New repository”
    • Decide between “private” or “public”. Initialize with a README. Accept default for everything else.
    • Click “Create repository”
    • Copy the HTTPS

3. Version control: Git + GitHub – Code along

  • Clone your GitHub repo to your computer: Type git clone https://github.com/YOUR-USERNAME/YOUR-REPOSITORY.git (your link) in the shell (Terminal in RStudio or on Mac, Git Bash on Windows)
    • Make this repo your working directory (cd YOUR-REPOSITORY), list its files (ls), display README (head README.md), get info on its connection to GitHub (git remote show origin)
    • Make a local change: Add a new line to README, and verify change is noticed:
echo "This is the first change to my repo" >> README.md
git status
  • Stage (add), commit (commit - m "YOUR-COMMIT-MESSAGE"), and push change. You may be asked for your username and password.
git add -A
git commit -m "A commit from my local computer"
git push
- (Clean up: Delete your local repo (`cd`, and then `rm -rf YOUR-REPO-NAME/`))

3. Version control: Git + GitHub – Code along

  • Clone your repository to RStudio
    • File > New Project > Select “Version Control” > Select “Git” > Enter your repository URL: https://github.com/YOUR-USERNAME/YOUR-REPOSITORY.git

3. Version control: Git + GitHub - Tricks & Troubleshooting

  • GitHub: No long-term guarantee for availability of service (is commercial)
    • Mirror snapshots on HU servers/OSF/Zenodo/FigShare/…
  • GitHub: .md-files will be displayed like HTML, CSV will have a nice layout, README.md-files act like the landing page. Use internal link to other files.

4. Package management: renv

4. renv – What & Why?

  • What it does:
    • Creates a project-specific library of packages in the project folder
    • Overwrites install.packages() to install packages in this local library
    • Keeps track of package versions in the renv.lock file


  • Why it helps:
    • Keeps package versions untouched by other projects
    • Allows you to revert to the previous state when an update has broken your analysis
    • Makes it easier to share package versions with your collaborators (e.g., via GitHub)
    • Can also keep track of Python packages

4. renv – How?

  1. Install renv just like any other R package via install.packages(renv)
  2. Initialize your project library via renv::init()
    (Instead, you can also select “Use renv with this project” during project creation)
  3. After successfully installing or updating packages, use renv::snapshot()
  4. If you want to revert to previous state (e.g., if an update caused problems), use renv::restore()

4. renv – How?

4. renv – Code along

  • Initialize renv for your sleepstudy project using renv::init()
  • From the “Files” pane in RStudio, take a look at the renv.lock file
  • Install a new package:
install.packages("cowsay")
  • Actually use the package in one of your scripts (or .Rmd files):
cowsay::say("Hello world", "cow")
  • Write this change to the lockfile using renv::snapshot()
  • Commit and push your changes to GitHub

4. renv – How?

Restoring someone else’s package versions:

  1. Clone or pull the repository from GitHub
  2. Open the the RStudio project (e.g. via the projectname.Rproj file)
  3. Use renv::restore() to install the package versions from the renv.lock file

4. renv – Troubleshooting

  • There may be some (inconsequential) warnings when switching between Mac and Windows
  • At least on Windows, you need to have Rtools installed when installing packages that are not on CRAN (https://cran.r-project.org/bin/windows/Rtools/)
  • Installing and loading packages may take a while, especially if your project lives on a network drive
    (such as N:/)

5. Containerization: Docker

5. Docker – What & Why?

  • What it does:
    • Creates a small, linux-based virtual machine on your computer
    • Makes it possible to run your scripts (or render your .Rmd files) on this virtual system
    • The recipe to build this system is stored in a Dockerfile that can be shared via GitHub


  • Why it helps:
    • Prevents differences between operating systems, R versions, region and language settings etc.
    • Ensures long-term reproducibility
    • Provides a starting point for cloud-based and high perfomance computing (HPC)
    • Pre-packaged Docker images are available for different languages (R, Python, MATLAB, LaTeX etc.)

5. Docker – How?

docker run -d  -e PASSWORD=1234 -p 8787:8787 -v /path/to/your/project:/home/rstudio/ rocker/rstudio
  • You can then access RStudio (running in the container) by opening http://localhost:8787 in your web browser (username: rstudio, password: 1234)
  • You can also build your own container by:
    • Choosing a base image from https://hub.docker.com/u/rocker (including the tidyverse, LaTeX etc.)
    • Creating a Dockerfile in your project directy, specyfing additional steps to execute when building the container, e.g., install.packages("renv"); renv::restore()

5. Docker – How?

  • Example for a Dockerfile:
# This as a text file stored with the name "Dockerfile" in your project directory.

# Base image from Docker Hub, including R, RStudio, the tidyverse, and LaTeX
FROM rocker/verse:4.0.2

# Set working directory within the container
WORDIR /home/rstudio

# Install renv
RUN R -e "remotes::install_version('renv', version = '0.12.0', repos = 'http://cran.us.r-project.org')"

# Copy the lock file
COPY renv.lock renv.lock

# Install package versions stored in the lockfile
RUN R -e "renv::consent(provided = TRUE)"
RUN R -e "renv::restore(prompt = FALSE)"

5. Docker – Beyond Docker

  • Some additional tools based on Docker:
    • With binder (https://mybinder.org) and Code Ocean (https://codeocean.com), you can run your analysis in the cloud; they will even create the Dockerfile for you if you don’t have your own one
    • Singularity (https://sylabs.io) is a fully compatible, open source clone of Docker which you can use on systems where you don’t have root access (e.g., on high performance clusters)


6. Where to start?

6. Where to start?

  • This wealth of tools can seem overwhelming
    • Adopting even one or two of them can help making your code more reproducbile
  • An RStudio project and renv are easy to set up even for existing projects
    • And will help a lot to make sure that you can still run your code at re-submission
  • Version control is best tried out with a new (real or toybox) project
    • Create an empty repository on GitHub and use it to create your RStudio project
  • Once you’ve made it this far, full computational reproduciblity (by containerizing your project) is just one more step away

Thank you.